Data Science Exam: Part 1

Author

Flemming Christensen

Published

October 6, 2024

library(readr)
library(rvest)

Attaching package: 'rvest'
The following object is masked from 'package:readr':

    guess_encoding
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ purrr     1.0.2
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter()         masks stats::filter()
✖ rvest::guess_encoding() masks readr::guess_encoding()
✖ dplyr::lag()            masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(dplyr)
library(ggplot2)
library(plotly)

Attaching package: 'plotly'

The following object is masked from 'package:ggplot2':

    last_plot

The following object is masked from 'package:stats':

    filter

The following object is masked from 'package:graphics':

    layout

Introduction

This exam project is constructed with 24 Exercises that need to be solved and argumented.

important info

Some steps have set the output to hide so only end results is visible.

Exercise 1

Read in the data in the file world_population.csv and select/deselect and rename columns so you end up with a tibble (tbl) named wpop_full with 266 rows and 65 columns with names as shown in the output below (the last column being 2022). Hint: Use skip in read_csv to avoid header lines not containing data or names of data.

import data from csv

data_dir <- here::here("world_population.csv") #set the directory to were the project and dataset is used

wpop_raw_dat = suppressMessages(read_csv(data_dir, skip = 3)) # Read csv file and skip the last update info and empty rows

Add column names and renaming

colnames(wpop_raw_dat) <- as.character(unlist(wpop_raw_dat[1,])) # Make the first row the column names

# next two steps are needed to remove exsta column added when making column names

wpop_raw_dat <- wpop_raw_dat[-1, ] # Remove the first row

rownames(wpop_raw_dat) <- NULL # Reset row names

head(wpop_raw_dat) # Confirm that data set now have column names

wpop_raw_dat <- wpop_raw_dat |> # Rename of duplicated meaning in column names
  rename(
    country = `Country Name`,
    code = `Country Code`
  )

head(wpop_raw_dat) # Confirm tha data structure is correct

Remove unwanted columns

wpop_raw_dat <- wpop_raw_dat |> # remove column "indicator Name" and "Indicator Code".
  select(-`Indicator Name`, -`Indicator Code`)

#head(wpop_raw_dat) # confirm that columns have been removed.

wpop_full <- wpop_raw_dat |> # Remove year 2023 so that 2022 is last column
  select(-`2023`)

wpop_full result

wpop_full
# A tibble: 266 × 65
   country  code  `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968`
   <chr>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Aruba    ABW   5.46e4 5.58e4 5.67e4 5.75e4 5.82e4 5.88e4 5.93e4 5.95e4 5.95e4
 2 Africa … AFE   1.31e8 1.34e8 1.38e8 1.42e8 1.46e8 1.50e8 1.54e8 1.58e8 1.63e8
 3 Afghani… AFG   8.62e6 8.79e6 8.97e6 9.16e6 9.36e6 9.57e6 9.78e6 1.00e7 1.02e7
 4 Africa … AFW   9.73e7 9.93e7 1.01e8 1.04e8 1.06e8 1.08e8 1.11e8 1.13e8 1.16e8
 5 Angola   AGO   5.36e6 5.44e6 5.52e6 5.60e6 5.67e6 5.74e6 5.79e6 5.83e6 5.87e6
 6 Albania  ALB   1.61e6 1.66e6 1.71e6 1.76e6 1.81e6 1.86e6 1.91e6 1.97e6 2.02e6
 7 Andorra  AND   9.44e3 1.02e4 1.10e4 1.18e4 1.27e4 1.36e4 1.45e4 1.57e4 1.71e4
 8 Arab Wo… ARB   9.34e7 9.58e7 9.83e7 1.01e8 1.04e8 1.06e8 1.09e8 1.12e8 1.16e8
 9 United … ARE   1.33e5 1.41e5 1.49e5 1.57e5 1.65e5 1.74e5 1.83e5 1.91e5 2.14e5
10 Argenti… ARG   2.03e7 2.07e7 2.10e7 2.14e7 2.17e7 2.21e7 2.24e7 2.28e7 2.31e7
# ℹ 256 more rows
# ℹ 54 more variables: `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
#   `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
#   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
#   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
#   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
#   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, …

Exercise 2

Use the package rvest to read in the list of country codes from the main table at ISO 3166-1 on Wikipedia and select/deselect and rename columns so you end up with a tibble (tbl) named iso_codes_all with 249 rows and 3 columns with names as shown in the output below.

Web scraping data from Wikipedia

url <- "https://en.wikipedia.org/wiki/ISO_3166-1"

webpage <- read_html(url)

tables <- webpage |> 
  html_nodes("table") |>  # Specify the CSS selector for the table. "html_nodes" takes all tables. "html_node" takes only the first.
  html_table()            # Convert the HTML table to a data frame

head(tables) # we need table number 2
  
raw_dat <- tables[[2]] # number 2 table aka ISO 3166-1 table
  

head(raw_dat) # OBS Afghanistan[c] the [C] means that the country is under that category "Naming and disputes" which is correct

remove and rename columns

colnames(raw_dat)

raw_dat <- raw_dat |>
  rename(
    name = `English short name  (using title case)`,
    iso3 = `Alpha-3 code`,
    independent = `Independent[b]`
  )

colnames(raw_dat) #check rename was done

iso_codes_all <- raw_dat |>
  select(name, iso3, independent)

iso_codes_all result

OBS Afghanistan[c] the [C] means that the country is under that category “Naming and disputes” which is correct

iso_codes_all 
# A tibble: 249 × 3
   name                iso3  independent
   <chr>               <chr> <chr>      
 1 Afghanistan[c]      AFG   Yes        
 2 Åland Islands       ALA   No         
 3 Albania             ALB   Yes        
 4 Algeria             DZA   Yes        
 5 American Samoa      ASM   No         
 6 Andorra             AND   Yes        
 7 Angola              AGO   Yes        
 8 Anguilla            AIA   No         
 9 Antarctica          ATA   No         
10 Antigua and Barbuda ATG   Yes        
# ℹ 239 more rows

Exercise 3

Use filter() to extract the independent countries from iso_codes_all and save the result as iso_codes.

iso_codes <- iso_codes_all |>
  filter(independent == "Yes")

iso_codes
# A tibble: 194 × 3
   name                iso3  independent
   <chr>               <chr> <chr>      
 1 Afghanistan[c]      AFG   Yes        
 2 Albania             ALB   Yes        
 3 Algeria             DZA   Yes        
 4 Andorra             AND   Yes        
 5 Angola              AGO   Yes        
 6 Antigua and Barbuda ATG   Yes        
 7 Argentina           ARG   Yes        
 8 Armenia             ARM   Yes        
 9 Australia           AUS   Yes        
10 Austria             AUT   Yes        
# ℹ 184 more rows

Exercise 4

Use a suitable join (and/or filter) command to make a dataset wpop only containing those rows of wpop_full which have a matching ISO country code in iso_codes:

# returns all rows from wpop_full that have a match in iso_codes.
# based on the condition (by = c("code" = "iso3")).
# but doesn't include columns from iso_codes
wpop <- semi_join(wpop_full, iso_codes, by = c("code" = "iso3"))

wpop
# A tibble: 193 × 65
   country  code  `1960` `1961` `1962` `1963` `1964` `1965` `1966` `1967` `1968`
   <chr>    <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghani… AFG   8.62e6 8.79e6 8.97e6 9.16e6 9.36e6 9.57e6 9.78e6 1.00e7 1.02e7
 2 Angola   AGO   5.36e6 5.44e6 5.52e6 5.60e6 5.67e6 5.74e6 5.79e6 5.83e6 5.87e6
 3 Albania  ALB   1.61e6 1.66e6 1.71e6 1.76e6 1.81e6 1.86e6 1.91e6 1.97e6 2.02e6
 4 Andorra  AND   9.44e3 1.02e4 1.10e4 1.18e4 1.27e4 1.36e4 1.45e4 1.57e4 1.71e4
 5 United … ARE   1.33e5 1.41e5 1.49e5 1.57e5 1.65e5 1.74e5 1.83e5 1.91e5 2.14e5
 6 Argenti… ARG   2.03e7 2.07e7 2.10e7 2.14e7 2.17e7 2.21e7 2.24e7 2.28e7 2.31e7
 7 Armenia  ARM   1.90e6 1.97e6 2.04e6 2.11e6 2.17e6 2.23e6 2.30e6 2.36e6 2.42e6
 8 Antigua… ATG   5.53e4 5.62e4 5.70e4 5.78e4 5.87e4 5.96e4 6.06e4 6.16e4 6.27e4
 9 Austral… AUS   1.03e7 1.05e7 1.07e7 1.10e7 1.12e7 1.14e7 1.17e7 1.18e7 1.20e7
10 Austria  AUT   7.05e6 7.09e6 7.13e6 7.18e6 7.22e6 7.27e6 7.32e6 7.38e6 7.42e6
# ℹ 183 more rows
# ℹ 54 more variables: `1969` <dbl>, `1970` <dbl>, `1971` <dbl>, `1972` <dbl>,
#   `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>, `1977` <dbl>,
#   `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>, `1982` <dbl>,
#   `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>, `1987` <dbl>,
#   `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>, `1992` <dbl>,
#   `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, `1997` <dbl>, …

Exercise 5

Show the countries/areas which have the same ISO country code in both wpop and iso_codes but different (spellings of) country names.

join_data <- inner_join(wpop, iso_codes, by = c("code" = "iso3"))

different_spelling <- filter(join_data, country != name)

result <- different_spelling |> 
  select(name, code, country)

result
# A tibble: 28 × 3
   name                              code  country              
   <chr>                             <chr> <chr>                
 1 Afghanistan[c]                    AFG   Afghanistan          
 2 Bahamas                           BHS   Bahamas, The         
 3 Bolivia, Plurinational State of   BOL   Bolivia              
 4 China[c]                          CHN   China                
 5 Côte d'Ivoire                     CIV   Cote d'Ivoire        
 6 Congo, Democratic Republic of the COD   Congo, Dem. Rep.     
 7 Congo                             COG   Congo, Rep.          
 8 Cyprus[c]                         CYP   Cyprus               
 9 Egypt                             EGY   Egypt, Arab Rep.     
10 Micronesia, Federated States of   FSM   Micronesia, Fed. Sts.
# ℹ 18 more rows

Exercise 6

Use the package rvest to read in the list of countries with corresponding continent codes from the main table at List of sovereign states and dependent territories by continent and select/deselect and rename columns so you end up with a tibble (tbl) named continents with 253 rows and 2 columns with names as shown in the output below.

Important hint: you need convert = FALSE in html_table() to avoid the text string "NA" (North America) to be interpreted as missing data (Not Available).

Web scraping data from Wikipedia

url <- "https://en.wikipedia.org/wiki/List_of_sovereign_states_and_dependent_territories_by_continent_(data_file)"

webpage <- read_html(url)

tables <- webpage |> 
  html_nodes("table") |> # Specify the CSS selector for the table.
  html_table(convert = FALSE)

print(tables)

raw_continents <- tables[[3]]

raw_continents

remove and rename columns

colnames(raw_continents)

raw_continents <- raw_continents |>
  rename(
    continent = `CC`,
    iso3 = `a-3`
  )

colnames(raw_continents) #check rename was done

continents <- raw_continents |>
  select(continent, iso3)

Result

continents
# A tibble: 253 × 2
   continent iso3 
   <chr>     <chr>
 1 AS        AFG  
 2 EU        ALB  
 3 AN        ATA  
 4 AF        DZA  
 5 OC        ASM  
 6 EU        AND  
 7 AF        AGO  
 8 NA        ATG  
 9 AS        AZE  
10 SA        ARG  
# ℹ 243 more rows

Exercise 7

Make a new dataset wpop2 by extending wpop with the extra column continent from the continents data (possibly using relocate() to move the continent column to the left to see it more clearly).

join_data <- inner_join(wpop, continents, by = c("code" = "iso3")) #join wpop and continent table.

wpop2 <- join_data |>
  relocate(continent, .before = country) # set continent to left of country

wpop2
# A tibble: 193 × 66
   continent country      code  `1960` `1961` `1962` `1963` `1964` `1965` `1966`
   <chr>     <chr>        <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 AS        Afghanistan  AFG   8.62e6 8.79e6 8.97e6 9.16e6 9.36e6 9.57e6 9.78e6
 2 AF        Angola       AGO   5.36e6 5.44e6 5.52e6 5.60e6 5.67e6 5.74e6 5.79e6
 3 EU        Albania      ALB   1.61e6 1.66e6 1.71e6 1.76e6 1.81e6 1.86e6 1.91e6
 4 EU        Andorra      AND   9.44e3 1.02e4 1.10e4 1.18e4 1.27e4 1.36e4 1.45e4
 5 AS        United Arab… ARE   1.33e5 1.41e5 1.49e5 1.57e5 1.65e5 1.74e5 1.83e5
 6 SA        Argentina    ARG   2.03e7 2.07e7 2.10e7 2.14e7 2.17e7 2.21e7 2.24e7
 7 AS        Armenia      ARM   1.90e6 1.97e6 2.04e6 2.11e6 2.17e6 2.23e6 2.30e6
 8 NA        Antigua and… ATG   5.53e4 5.62e4 5.70e4 5.78e4 5.87e4 5.96e4 6.06e4
 9 OC        Australia    AUS   1.03e7 1.05e7 1.07e7 1.10e7 1.12e7 1.14e7 1.17e7
10 EU        Austria      AUT   7.05e6 7.09e6 7.13e6 7.18e6 7.22e6 7.27e6 7.32e6
# ℹ 183 more rows
# ℹ 56 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …

Exercise 8

Use pivot_longer() to reshape wpop2 into “long format” with columns as shown below (in particular make sure year is a numeric variable) and call the resulting tibble pop_long.

pop_long <- wpop2 |>
  pivot_longer(cols = `1960`:`2022`,  # Columns to pivot (year columns)
               names_to = "year",     # New column for years
               values_to = "pop")     # New column for population values

pop_long <- pop_long |>
  mutate(year = as.numeric(year))  # Convert year to numeric

pop_long
# A tibble: 12,159 × 5
   continent country     code   year      pop
   <chr>     <chr>       <chr> <dbl>    <dbl>
 1 AS        Afghanistan AFG    1960  8622466
 2 AS        Afghanistan AFG    1961  8790140
 3 AS        Afghanistan AFG    1962  8969047
 4 AS        Afghanistan AFG    1963  9157465
 5 AS        Afghanistan AFG    1964  9355514
 6 AS        Afghanistan AFG    1965  9565147
 7 AS        Afghanistan AFG    1966  9783147
 8 AS        Afghanistan AFG    1967 10010030
 9 AS        Afghanistan AFG    1968 10247780
10 AS        Afghanistan AFG    1969 10494489
# ℹ 12,149 more rows

Exercise 9

Make a line plot showing the population over all the years in the data with one line per country with semi-transparent lines.

ggplot(pop_long, aes(x = year, y = pop, group = country)) +
  geom_line(alpha = 0.2) +
  labs(title = "Population Growth Over Time by Country",
       x = "Year",
       y = "Population",) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 20))

Exercise 10

Use the code below to rescale each country’s population size to an population index which in 1 for every country in 1960. An index value of e.g. 2 would mean that the population size of that country has doubled since 1960.

pop_index_data <- pop_long |> 
  group_by(country,continent,code) |> 
  mutate(pop_index = pop/pop[1])

Exercise 11

Make a line plot showing the indexed population numbers over all the years in the data with one line per country with semi-transparent lines.

plt <- ggplot(pop_index_data, aes(x = year, y = pop_index, group = country)) +
  geom_line(alpha = 0.2) +
  labs(title = "Index 2 = double pop size since 1960\nIndex 6 = 6 times the pop size since 1960",
       x = "Year",
       y = "Population index for each country") +
  scale_x_continuous(breaks = seq(1960, 2022, by = 20))

interactive_plot <- ggplotly(plt) # Make the plot interactive

interactive_plot # Display the interactive plot

Exercise 12

Identify the two countries with extreme population indices and make a new line plot showing the indexed population numbers over all the years in the data with one line per country with semi-transparent lines without these two countries.

Identify two countries with extreme population indices

# method 1 use the interactive map above which showed that United Arab Emirates and Qatar are the extreme population indices.

extreme_pop_indices <- pop_index_data |>
  filter(pop_index > 60) |> # 60 because only two countries in 2020 are above it and can be considered extreme.
  select(country, continent, code) |>
  distinct()

extreme_pop_indices #United Arab Emirates and Qatar must be removed. This 
# A tibble: 2 × 3
# Groups:   country, continent, code [2]
  country              continent code 
  <chr>                <chr>     <chr>
1 United Arab Emirates AS        ARE  
2 Qatar                AS        QAT  
pop_index_data <- pop_index_data |>
  filter(!country %in% c("United Arab Emirates", "Qatar"))

new plot without extreme popultation index

plt <- ggplot(pop_index_data, aes(x = year, y = pop_index, group = country)) +
  geom_line(alpha = 0.2) +
  labs(title = "Without Qatar and UAE",
       x = "Year",
       y = "Population index for each country") +
  scale_x_continuous(breaks = seq(1960, 2022, by = 20))

interactive_plot <- ggplotly(plt) # Make the plot interactive

interactive_plot # Display the interactive plot

Exercise 13

Run the following command and describe in a few words what the result growth_long is.

growth_long <- pop_long |> 
  group_by(country,continent,code) |>                 # groups country first then continent and finally code. 
  reframe(pop_start = pop[1:(length(pop)-1)],         # take all elements of "pop" except last.
          pop_end = pop[2:length(pop)],               # take all elements of "pop" except first.
          growth = 100*(pop_end-pop_start)/pop_start, # Calculate the growth rate between years "in %".
          end_year = year[2:length(year)])            # End of year of each interval "remove 1960"

growth_long
# A tibble: 11,966 × 7
   country     continent code  pop_start  pop_end growth end_year
   <chr>       <chr>     <chr>     <dbl>    <dbl>  <dbl>    <dbl>
 1 Afghanistan AS        AFG     8622466  8790140   1.94     1961
 2 Afghanistan AS        AFG     8790140  8969047   2.04     1962
 3 Afghanistan AS        AFG     8969047  9157465   2.10     1963
 4 Afghanistan AS        AFG     9157465  9355514   2.16     1964
 5 Afghanistan AS        AFG     9355514  9565147   2.24     1965
 6 Afghanistan AS        AFG     9565147  9783147   2.28     1966
 7 Afghanistan AS        AFG     9783147 10010030   2.32     1967
 8 Afghanistan AS        AFG    10010030 10247780   2.38     1968
 9 Afghanistan AS        AFG    10247780 10494489   2.41     1969
10 Afghanistan AS        AFG    10494489 10752971   2.46     1970
# ℹ 11,956 more rows
# explain it wiht few words:

# Population start year is 1960 which is the fist pop_start of each country.
# pop_end is the population count by the end of the year. example 1960 - 1961 which is then the pop_stat in 1961.
# growth is the percentage growth of the population each year.

Exercise 14

Make a line plot showing the population over all the years in the data with one line per country with semi-transparent lines.

ggplot(growth_long, aes(end_year, growth, group = country))+
  geom_line(alpha = 0.2) +
    labs(title = "Population growth per year in percentage all countries",
         x = "Year",
         y = "Year-to-year population growth in pct") +
    scale_x_continuous(breaks = seq(1960, 2022, by = 20))

Exercise 15

Make a similar graphic as above but with one panel/facet per continent.

ggplot(growth_long, aes(end_year, growth, group = country))+
  geom_line(alpha = 0.2) +
    labs(title = "Population growth per year in percentage foreach country per continent",
         x = "Year",
         y = "Year-to-year population growth in pct") +
    scale_x_continuous(breaks = seq(1960, 2022, by = 20)) +
  facet_wrap(~ continent)  # Create one panel per continent

Exercise 16

For each country find both the largest positive and the smallest (most negative) growth over the years in the data so you end up with a tibble (tbl) named growth_range with 193 rows and 3 columns with names as shown in the output below. hint: group_by() and summarise() are your friends.

growth_range <- growth_long |>
  group_by(country) |>                          # groups country. 
  reframe(max_growth = signif(max(growth),3),   # largest positive growth. signif rounds number to specified number of significant 
          min_growth = signif(min(growth),3))   # smallest/negative growth.

growth_range
# A tibble: 193 × 3
   country             max_growth min_growth
   <chr>                    <dbl>      <dbl>
 1 Afghanistan              16.1     -10.7  
 2 Albania                   3.17     -1.21 
 3 Algeria                   4.93      1.36 
 4 Andorra                   8.47     -3.16 
 5 Angola                    3.83      0.698
 6 Antigua and Barbuda       2.05     -0.577
 7 Argentina                 1.72      0.256
 8 Armenia                   3.54     -3.28 
 9 Australia                 3.44      0.141
10 Austria                   1.13     -0.265
# ℹ 183 more rows

Exercise 17

Find the 10 countries which have experienced the largest growth percentage of all at some point over the years in the data.

top10 <- growth_range |>
  arrange(desc(max_growth)) |> # Sort by max growth in descending order (largest to smallest)
  head(10)

top10
# A tibble: 10 × 3
   country              max_growth min_growth
   <chr>                     <dbl>      <dbl>
 1 Qatar                      21.4     -2.61 
 2 Kuwait                     21      -24.2  
 3 Seychelles                 20.8     -2.59 
 4 United Arab Emirates       19.9      0.782
 5 Rwanda                     18.1    -15.5  
 6 Afghanistan                16.1    -10.7  
 7 Lebanon                    14.1     -8.83 
 8 Somalia                    13.2     -4.52 
 9 Jordan                     12.5      1.23 
10 Oman                       11.3     -1.29 

Exercise 18

Find the 10 countries which have experienced the most negative growth percentage of all at some point over the years in the data.

bottom10 <- growth_range |>
  arrange(min_growth) |> # Sort by min growth in with arranage() (smallest to largest)
  head(10)

bottom10
# A tibble: 10 × 3
   country                max_growth min_growth
   <chr>                       <dbl>      <dbl>
 1 Kuwait                     21         -24.2 
 2 Rwanda                     18.1       -15.5 
 3 Ukraine                     1.4       -13.3 
 4 Liberia                    10.7       -12.2 
 5 Afghanistan                16.1       -10.7 
 6 Lebanon                    14.1        -8.83
 7 Bosnia and Herzegovina      4.19       -7.78
 8 Syrian Arab Republic        6.54       -6.62
 9 Cambodia                    5.41       -6.25
10 Bulgaria                    0.963      -6   

Exercise 19

Make a line plot showing the population over all the years in the data with different colours for each country represented in top10.

top10Years <- growth_long |>
  inner_join(top10, by = "country") |>
  select(country, growth, end_year) # select relevant columns

ggplot(top10Years, aes(end_year, growth, group = country, color = country))+
  geom_line(alpha = 0.6) +
    labs(title = "top 10 country growth per year in percentage",
         x = "end_Year",
         y = "Year-to-year population growth in pct") +
    scale_x_continuous(breaks = seq(1960, 2022, by = 20))

Exercise 20

Use pivot_wider() to reshape growth_long to wide format with one column per year and call the result growth.

growth <- growth_long |>
  select(country, continent, code, end_year, growth) |> # select relevant columns otherwise pop_start and end will ruin it
  pivot_wider(
    names_from = end_year,
    values_from = growth
  ) |>
    arrange(country)

growth
# A tibble: 193 × 65
   country      continent code  `1961` `1962` `1963` `1964` `1965` `1966` `1967`
   <chr>        <chr>     <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1 Afghanistan  AS        AFG    1.94   2.04   2.10   2.16   2.24   2.28   2.32 
 2 Albania      EU        ALB    3.17   3.10   3.00   2.92   2.79   2.67   2.67 
 3 Algeria      AF        DZA    1.79   1.55   1.62   1.75   1.66   1.87   2.25 
 4 Andorra      EU        AND    8.19   7.81   7.49   7.19   6.88   7.25   8.24 
 5 Angola       AF        AGO    1.57   1.47   1.42   1.31   1.12   0.880  0.699
 6 Antigua and… NA        ATG    1.63   1.36   1.35   1.53   1.67   1.63   1.65 
 7 Argentina    SA        ARG    1.63   1.64   1.63   1.61   1.59   1.58   1.58 
 8 Armenia      AS        ARM    3.54   3.44   3.28   3.08   2.90   2.75   2.64 
 9 Australia    OC        AUS    2.01   2.47   1.94   1.98   1.98   2.31   1.27 
10 Austria      EU        AUT    0.550  0.615  0.644  0.669  0.652  0.704  0.750
# ℹ 183 more rows
# ℹ 55 more variables: `1968` <dbl>, `1969` <dbl>, `1970` <dbl>, `1971` <dbl>,
#   `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>, `1976` <dbl>,
#   `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>, `1981` <dbl>,
#   `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>, `1986` <dbl>,
#   `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>, `1991` <dbl>,
#   `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, `1996` <dbl>, …

Exercise 21

Make a hierarchical clustering of the growth data in 9 groups. The following hints can be used to achieve this:

clustering is unsupervised learning where our input data have no labels. Clustering can help us with this to group our data and label them.

1: Remove non-numerical columns from data

growth_Numerical <- Filter(is.numeric, growth) #only return numeric columns. 

# growth_Numerical <- growth |> select_if(is.numeric) could also be used.

growth_Numerical # check

2: Calculate distances

# Compute the distance matrix using Euclidean distance
distances <- dist(growth_Numerical, method = "euclidean")

3: Use hclust() to run the clustering algorithm

hc <- hclust(distances)
hc

4: Use cutree() to make the cluster labels

clusters <- cutree(hc, k = 9) # 9 groups
clusters

5: Use mutate to add the cluster label as a variable to the dataset growth

growth_clust <- growth |>
  mutate(cluster = clusters) |>
  relocate(cluster, .before = country)

Result

growth_clust
# A tibble: 193 × 66
   cluster country     continent code  `1961` `1962` `1963` `1964` `1965` `1966`
     <int> <chr>       <chr>     <chr>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>  <dbl>
 1       1 Afghanistan AS        AFG    1.94   2.04   2.10   2.16   2.24   2.28 
 2       2 Albania     EU        ALB    3.17   3.10   3.00   2.92   2.79   2.67 
 3       3 Algeria     AF        DZA    1.79   1.55   1.62   1.75   1.66   1.87 
 4       4 Andorra     EU        AND    8.19   7.81   7.49   7.19   6.88   7.25 
 5       3 Angola      AF        AGO    1.57   1.47   1.42   1.31   1.12   0.880
 6       2 Antigua an… NA        ATG    1.63   1.36   1.35   1.53   1.67   1.63 
 7       2 Argentina   SA        ARG    1.63   1.64   1.63   1.61   1.59   1.58 
 8       2 Armenia     AS        ARM    3.54   3.44   3.28   3.08   2.90   2.75 
 9       2 Australia   OC        AUS    2.01   2.47   1.94   1.98   1.98   2.31 
10       2 Austria     EU        AUT    0.550  0.615  0.644  0.669  0.652  0.704
# ℹ 183 more rows
# ℹ 56 more variables: `1967` <dbl>, `1968` <dbl>, `1969` <dbl>, `1970` <dbl>,
#   `1971` <dbl>, `1972` <dbl>, `1973` <dbl>, `1974` <dbl>, `1975` <dbl>,
#   `1976` <dbl>, `1977` <dbl>, `1978` <dbl>, `1979` <dbl>, `1980` <dbl>,
#   `1981` <dbl>, `1982` <dbl>, `1983` <dbl>, `1984` <dbl>, `1985` <dbl>,
#   `1986` <dbl>, `1987` <dbl>, `1988` <dbl>, `1989` <dbl>, `1990` <dbl>,
#   `1991` <dbl>, `1992` <dbl>, `1993` <dbl>, `1994` <dbl>, `1995` <dbl>, …

Exercise 22

Use group_split() and lapply() similarly to below to extract the countries of each cluster and display them nicely with pander::pander():

growth_clust |> 
  group_split(cluster) |> 
  lapply(function(x) x$country) |> 
  pander::pander()
  • Afghanistan
  • Albania, Antigua and Barbuda, Argentina, Armenia, Australia, Austria, Azerbaijan, Barbados, Belarus, Belgium, Bosnia and Herzegovina, Bulgaria, Cabo Verde, Canada, Chile, China, Croatia, Cuba, Cyprus, Czechia, Denmark, Dominica, El Salvador, Estonia, Fiji, Finland, France, Georgia, Germany, Greece, Grenada, Guyana, Hungary, Iceland, Ireland, Italy, Jamaica, Japan, Kazakhstan, Korea, Dem. People’s Rep., Korea, Rep., Latvia, Liechtenstein, Lithuania, Luxembourg, Malta, Mauritius, Moldova, Monaco, Montenegro, Myanmar, Nauru, Netherlands, New Zealand, North Macedonia, Norway, Palau, Poland, Portugal, Romania, Russian Federation, Samoa, San Marino, Serbia, Slovak Republic, Slovenia, Spain, Sri Lanka, St. Kitts and Nevis, St. Lucia, St. Vincent and the Grenadines, Suriname, Sweden, Switzerland, Thailand, Tonga, Trinidad and Tobago, Ukraine, United Kingdom, United States and Uruguay
  • Algeria, Angola, Bahamas, The, Bahrain, Bangladesh, Belize, Benin, Bhutan, Bolivia, Botswana, Brazil, Brunei Darussalam, Burkina Faso, Burundi, Cambodia, Cameroon, Central African Republic, Chad, Colombia, Comoros, Congo, Dem. Rep., Congo, Rep., Costa Rica, Cote d’Ivoire, Dominican Republic, Ecuador, Egypt, Arab Rep., Equatorial Guinea, Eritrea, Eswatini, Ethiopia, Gabon, Gambia, The, Ghana, Guatemala, Guinea, Guinea-Bissau, Haiti, Honduras, India, Indonesia, Iran, Islamic Rep., Iraq, Israel, Kenya, Kiribati, Kyrgyz Republic, Lao PDR, Lesotho, Liberia, Libya, Madagascar, Malawi, Malaysia, Maldives, Mali, Mauritania, Mexico, Micronesia, Fed. Sts., Mongolia, Morocco, Mozambique, Namibia, Nepal, Nicaragua, Niger, Nigeria, Pakistan, Panama, Papua New Guinea, Paraguay, Peru, Philippines, Sao Tome and Principe, Saudi Arabia, Senegal, Sierra Leone, Singapore, Solomon Islands, Somalia, South Africa, South Sudan, Sudan, Syrian Arab Republic, Tajikistan, Tanzania, Timor-Leste, Togo, Tunisia, Turkiye, Turkmenistan, Tuvalu, Uganda, Uzbekistan, Vanuatu, Venezuela, RB, Viet Nam, Yemen, Rep., Zambia and Zimbabwe
  • Andorra, Djibouti, Jordan, Marshall Islands and Oman
  • Kuwait
  • Lebanon
  • Qatar and United Arab Emirates
  • Rwanda
  • Seychelles

Exercise 23

Use pivot_longer() to convert growth_clust to long format and plot growth as a function of time with a panel/facet for each cluster.

Use pivot_longer() to convert growth_clust to long format

growth_clust_long <- growth_clust |>
  pivot_longer(cols = `1961`:`2022`,  # Columns to pivot (year columns)
               names_to = "year",     
               values_to = "growth")
growth_clust_long

plot growth as a function of time with a panel/facet for each cluster.

# remember year is <chr> so it needs to be converted to numeric

ggplot(growth_clust_long, aes(x = as.numeric(year), y = growth, group = country)) +
  geom_line(alpha = 0.6) +
  facet_wrap(~cluster) + # Create a panel for each cluster
  labs(title = "Growth over time for each Cluster",
       x = "Year",
       y = "Year-to-year population growth in pct")

Exercise 24

The code below can be used to calculate average growth rates over several years.

agg_year <- 5 # define numbers of years to aggregate the growth rates

aggr_growth_long <- growth_long |>
  mutate(period=((end_year - min(end_year)) %/% agg_year) * agg_year + min(end_year)) |> 
  group_by(period, code, country, continent) |> #needed to subsequent summarization
  summarise(avg_growth = mean(growth)) #calculate avg growth foreach group
`summarise()` has grouped output by 'period', 'code', 'country'. You can
override using the `.groups` argument.
aggr_growth_long

Use the clustering technique of the previous exercise to divide the data into different clusters. Experiment with several period lengths (aggregation years) and number of clusters and show results for at least one combination of agg_year and number of clusters. An example is given below.

Prepare the data for clustering

# Pivot the data to wide format
aggr_growth_wide <- aggr_growth_long |>
  select(code, country, period, avg_growth) |>
  pivot_wider(
    names_from = period,
    values_from = avg_growth
  )

head(aggr_growth_wide)

# Remove non-numerical columns.

aggr_growth_Numerical <- Filter(is.numeric, aggr_growth_wide) #only return numeric columns. 

head(aggr_growth_Numerical)

Calculate distance

# Compute the distance matrix using Euclidean distance
distances_aggr_growth <- dist(aggr_growth_Numerical, method = "euclidean")

Cluster algorithm

hc <- hclust(distances_aggr_growth)
hc

label the clusters

clusters <- cutree(hc, k = 9) # 9 groups

clusters

add cluster labels to aggr_growth_wide

# to add cluster to aggr_growth_wide we need to ungroup the dataframe so it add a cluster number foreach country and not group.
aggr_growth_wide <- aggr_growth_wide |> ungroup()  

aggr_growth_clust <- aggr_growth_wide |>
  mutate(cluster = clusters) |>
  relocate(cluster, .after = country)

head(aggr_growth_clust)

prepare for plot

aggr_growth_clust

aggr_growth_clust_long <- aggr_growth_clust |>
  pivot_longer(cols = `1961`:`2021`,  # Columns to pivot (year columns)
               names_to = "year",        # New column for years
               values_to = "avg_growth") |> # New column for avg_growth values
               group_by(country)

head(aggr_growth_clust_long)

Plot growth avg of each country aggregatied year (5)

ggplot(aggr_growth_clust_long, aes(as.numeric(year), avg_growth, group_by =  country)) +
  geom_line(alpha = 0.2) +
  scale_x_continuous(breaks = seq(1960, 2022, by = 20)) +
  labs(title = "Average Growth Over the aggregated years (5) by country", x = "Year", y = "Average growth over aggregated years (5) foreach country")

extract the countries of each cluster and display

aggr_growth_clust |> 
  group_split(cluster) |> 
  lapply(function(x) x$country) |> 
  pander::pander()
  • Afghanistan
  • Angola, Burundi, Benin, Burkina Faso, Bangladesh, Bahrain, Bahamas, The, Belize, Bolivia, Brazil, Brunei Darussalam, Bhutan, Botswana, Central African Republic, Cote d’Ivoire, Cameroon, Congo, Dem. Rep., Congo, Rep., Colombia, Comoros, Cabo Verde, Costa Rica, Dominican Republic, Algeria, Ecuador, Egypt, Arab Rep., Eritrea, Ethiopia, Micronesia, Fed. Sts., Gabon, Ghana, Guinea, Gambia, The, Guinea-Bissau, Guatemala, Honduras, Haiti, Indonesia, India, Iran, Islamic Rep., Iraq, Israel, Kenya, Kyrgyz Republic, Cambodia, Kiribati, Lao PDR, Lebanon, Liberia, Libya, Lesotho, Morocco, Madagascar, Maldives, Mexico, Mali, Mongolia, Mozambique, Mauritania, Malawi, Malaysia, Namibia, Niger, Nigeria, Nicaragua, Nepal, Nauru, Pakistan, Panama, Peru, Philippines, Palau, Papua New Guinea, Paraguay, Rwanda, Saudi Arabia, Sudan, Senegal, Singapore, Solomon Islands, Sierra Leone, Somalia, South Sudan, Sao Tome and Principe, Suriname, Eswatini, Syrian Arab Republic, Chad, Togo, Tajikistan, Turkmenistan, Timor-Leste, Tunisia, Turkiye, Tanzania, Uganda, Uzbekistan, Venezuela, RB, Viet Nam, Vanuatu, Yemen, Rep., South Africa, Zambia and Zimbabwe
  • Albania, Argentina, Armenia, Antigua and Barbuda, Australia, Austria, Azerbaijan, Belgium, Belarus, Canada, Switzerland, Chile, China, Cuba, Cyprus, Dominica, Denmark, Spain, Fiji, France, United Kingdom, Grenada, Guyana, Ireland, Iceland, Jamaica, Japan, Kazakhstan, St. Kitts and Nevis, Korea, Rep., St. Lucia, Liechtenstein, Sri Lanka, Luxembourg, Monaco, North Macedonia, Malta, Myanmar, Mauritius, Netherlands, Norway, New Zealand, Poland, Korea, Dem. People’s Rep., Russian Federation, El Salvador, San Marino, Slovak Republic, Slovenia, Sweden, Thailand, Tonga, Trinidad and Tobago, Tuvalu, Uruguay, United States, St. Vincent and the Grenadines and Samoa
  • Andorra and Djibouti
  • United Arab Emirates and Qatar
  • Bulgaria, Bosnia and Herzegovina, Barbados, Czechia, Germany, Estonia, Finland, Georgia, Greece, Croatia, Hungary, Italy, Lithuania, Latvia, Moldova, Marshall Islands, Montenegro, Portugal, Romania, Serbia and Ukraine
  • Equatorial Guinea, Jordan and Oman
  • Kuwait
  • Seychelles

facet plot

ggplot(aggr_growth_clust_long, aes(as.numeric(year), avg_growth, group_by =  country)) +
  geom_line(alpha = 0.2) +
  facet_wrap(~cluster) + # Create a panel for each cluster
  labs(title = "Growth over Time per 5 year by Cluster",
       x = "Year",
       y = "Average growth over aggregated years (5) foreach cluster")

Project dependencies

sessionInfo()
R version 4.4.1 (2024-06-14 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 10 x64 (build 19045)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Europe/Copenhagen
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] plotly_4.10.4   lubridate_1.9.3 forcats_1.0.0   stringr_1.5.1  
 [5] dplyr_1.1.4     purrr_1.0.2     tidyr_1.3.1     tibble_3.2.1   
 [9] ggplot2_3.5.1   tidyverse_2.0.0 rvest_1.0.4     readr_2.1.5    

loaded via a namespace (and not attached):
 [1] utf8_1.2.4        generics_0.1.3    xml2_1.3.6        stringi_1.8.4    
 [5] hms_1.1.3         digest_0.6.37     magrittr_2.0.3    evaluate_0.24.0  
 [9] grid_4.4.1        timechange_0.3.0  fastmap_1.2.0     rprojroot_2.0.4  
[13] jsonlite_1.8.8    httr_1.4.7        pander_0.6.5      selectr_0.4-2    
[17] fansi_1.0.6       crosstalk_1.2.1   viridisLite_0.4.2 scales_1.3.0     
[21] lazyeval_0.2.2    cli_3.6.3         crayon_1.5.3      rlang_1.1.4      
[25] bit64_4.0.5       munsell_0.5.1     withr_3.0.1       yaml_2.3.10      
[29] parallel_4.4.1    tools_4.4.1       tzdb_0.4.0        colorspace_2.1-1 
[33] here_1.0.1        curl_5.2.1        vctrs_0.6.5       R6_2.5.1         
[37] lifecycle_1.0.4   htmlwidgets_1.6.4 bit_4.0.5         vroom_1.6.5      
[41] pkgconfig_2.0.3   pillar_1.9.0      gtable_0.3.5      Rcpp_1.0.13      
[45] glue_1.7.0        data.table_1.15.4 xfun_0.47         tidyselect_1.2.1 
[49] rstudioapi_0.16.0 knitr_1.48        farver_2.1.2      htmltools_0.5.8.1
[53] labeling_0.4.3    rmarkdown_2.28    compiler_4.4.1